Group Project¶

In [1]:
library(tidyverse)
library(repr)
library(infer)
library(cowplot)
library(broom)
library(GGally)
library(AER)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: ‘cowplot’


The following object is masked from ‘package:lubridate’:

    stamp


Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Loading required package: car

Loading required package: carData


Attaching package: ‘car’


The following object is masked from ‘package:dplyr’:

    recode


The following object is masked from ‘package:purrr’:

    some


Loading required package: lmtest

Loading required package: zoo


Attaching package: ‘zoo’


The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


Loading required package: sandwich

Loading required package: survival

Demonstrate that the dataset can be read from the web into R.

In [2]:
data <- read.csv(url("https://raw.githubusercontent.com/SiyingLiu03/GroupProjectSTAT301/main/HR-Employee-Attrition.csv"))
head(data)
A data.frame: 6 × 35
AgeAttritionBusinessTravelDailyRateDepartmentDistanceFromHomeEducationEducationFieldEmployeeCountEmployeeNumber⋯RelationshipSatisfactionStandardHoursStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
<int><chr><chr><int><chr><int><int><chr><int><int>⋯<int><int><int><int><int><int><int><int><int><int>
141YesTravel_Rarely 1102Sales 12Life Sciences11⋯1800 801 6405
249No Travel_Frequently 279Research & Development81Life Sciences12⋯4801103310717
337YesTravel_Rarely 1373Research & Development22Other 14⋯2800 733 0000
433No Travel_Frequently1392Research & Development34Life Sciences15⋯3800 833 8730
527No Travel_Rarely 591Research & Development21Medical 17⋯4801 633 2222
632No Travel_Frequently1005Research & Development22Life Sciences18⋯3800 822 7736

The dataset contains the information about every employees, such as their personal information and information about their status in the company. The focus of dataset is the attrition status and can be used to help HR to decide to retain this employee or not. There are 1470 observations in the dataset and 35 variables. Below is the description of dataset。

Variable Type Description
Age continuous Age of Employee
Attrition categorical If the employee has been quited or some other reason
BusinessTravel categorical Rate of business travel
DailyRate continuous Amount of pay per day
Department categorical Department of employee
DistanceFromHome continuous Distance from home to work in miles
Education categorical Education level
EducationField categorical Field of study
EmployeeCount continuous Count of employee
EmployeeNumber categorical Employee number
EnvironmentSatisfaction categorical Employee satisfaction(from 1 to 4) at work
Gender categorical gender of employee
HourlyRate continuous employee hourly rate
JobInvolvement categorical Level of involvment
JobLevel categorical Level of senority
JobRole categorical Employee's role in the company
JobSatisfaction categorical Satisfaction of employee
MaritalStatus categorical Marital Status of employee
MonthlyIncome continuous Monthly income of an employee
MonthlyRate continuous Monthly rate of an employee
NumCompaniesWorked continuous Number of companies that employee has worked at before
Over18 categorical Employee is over 18 or not
OverTime categorical Whether the employee works overtime
PercentSalaryHike continuous Percent of salary hike
PerformanceRating categorical performance rate
RelationshipSatisfaction categorical satisfaction of relationship
StandardHours continuous per week standard work hours
StockOptionLevel categorical company stock option level
TotalWorkingYears continuous total working years
TrainingTimesLastYear continuous training time
WorkLifeBalance categorical balance between work and life
YearAtCompany continuous total years at current company
YearsInCurrentRole continuous total years in current role
YearsSinceLastPromotion continuous number of years since last promotion
YearsWithCurrManager continuous Years working under the current manager

Question: How is attrition (as our response variable) affected by the explanatary variables: Age,Education,EducationField, DailyRate, HourlyRate, JobSatisfaction, TotalWorkingYears, MonthlyRate, Gender, OverTime, PerformanceRating, YearsAtCompany, YearsSinceLastPromotion The question will be focused on prediction (select a model that can predict the attrition status).

Data Cleaning¶

Clean and wrangle the data into a tidy format.

In [3]:
# Convert the categorical variables in the dataset into factors
data$Attrition <- factor(data$Attrition)
data$BusinessTravel <- factor(data$BusinessTravel)
data$Department <- factor(data$Department)
data$EducationField <- factor(data$EducationField)
data$EmployeeCount <- factor(data$EmployeeCount)
data$EnvironmentSatisfaction <- factor(data$EnvironmentSatisfaction)
data$Gender <- factor(data$Gender)
data$Education <- factor(data$Education)
data$JobInvolvement <- factor(data$JobInvolvement)
data$JobLevel <- factor(data$JobLevel)
data$PerformanceRating <- factor(data$PerformanceRating)
data$JobRole <- factor(data$JobRole)
data$JobSatisfaction <- factor(data$JobSatisfaction)
data$MaritalStatus <- factor(data$MaritalStatus)
data$Over18 <- factor(data$Over18)
data$OverTime <- factor(data$OverTime)
data$RelationshipSatisfaction <- factor(data$RelationshipSatisfaction)
data$StockOptionLevel <- factor(data$StockOptionLevel)
data$WorkLifeBalance <- factor(data$WorkLifeBalance)

head(data)
A data.frame: 6 × 35
AgeAttritionBusinessTravelDailyRateDepartmentDistanceFromHomeEducationEducationFieldEmployeeCountEmployeeNumber⋯RelationshipSatisfactionStandardHoursStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
<int><fct><fct><int><fct><int><fct><fct><fct><int>⋯<fct><int><fct><int><int><fct><int><int><int><int>
141YesTravel_Rarely 1102Sales 12Life Sciences11⋯1800 801 6405
249No Travel_Frequently 279Research & Development81Life Sciences12⋯4801103310717
337YesTravel_Rarely 1373Research & Development22Other 14⋯2800 733 0000
433No Travel_Frequently1392Research & Development34Life Sciences15⋯3800 833 8730
527No Travel_Rarely 591Research & Development21Medical 17⋯4801 633 2222
632No Travel_Frequently1005Research & Development22Life Sciences18⋯3800 822 7736
In [4]:
#Remove rows with missing values
hrdata <- na.omit(data)
In [5]:
#Get the number of observations we have
n <- nrow(hrdata)
n
1470
In [6]:
# Get column names
column_names <- colnames(hrdata)
column_names
  1. 'Age'
  2. 'Attrition'
  3. 'BusinessTravel'
  4. 'DailyRate'
  5. 'Department'
  6. 'DistanceFromHome'
  7. 'Education'
  8. 'EducationField'
  9. 'EmployeeCount'
  10. 'EmployeeNumber'
  11. 'EnvironmentSatisfaction'
  12. 'Gender'
  13. 'HourlyRate'
  14. 'JobInvolvement'
  15. 'JobLevel'
  16. 'JobRole'
  17. 'JobSatisfaction'
  18. 'MaritalStatus'
  19. 'MonthlyIncome'
  20. 'MonthlyRate'
  21. 'NumCompaniesWorked'
  22. 'Over18'
  23. 'OverTime'
  24. 'PercentSalaryHike'
  25. 'PerformanceRating'
  26. 'RelationshipSatisfaction'
  27. 'StandardHours'
  28. 'StockOptionLevel'
  29. 'TotalWorkingYears'
  30. 'TrainingTimesLastYear'
  31. 'WorkLifeBalance'
  32. 'YearsAtCompany'
  33. 'YearsInCurrentRole'
  34. 'YearsSinceLastPromotion'
  35. 'YearsWithCurrManager'

Then, select the variables we are interested and change the names of columns. Because we have more than 30 variables in the dataset, it will be complicated to use them all to select a model. Here we just pre-select some of them as the explanatory variables and do the further analysis.

In [7]:
hrdata <- hrdata |>
        select(Age,Attrition,Education,EducationField, DailyRate, HourlyRate, JobSatisfaction, TotalWorkingYears, MonthlyRate,
              Gender, OverTime, 
               PerformanceRating, YearsAtCompany, YearsSinceLastPromotion)
hrdata <- hrdata %>%
  rename(age = Age,
         attrition = Attrition,
         edu = Education,
         hourly = HourlyRate,
         satisfaction = JobSatisfaction,
         wrkyrs = TotalWorkingYears,
         monthly = MonthlyRate,
         edufield = EducationField,
         daily = DailyRate,
         gender = Gender,
         overtime = OverTime,
         perrating = PerformanceRating,
         yratcom = YearsAtCompany,
         yrsincepromotion = YearsSinceLastPromotion
        )
head(hrdata)
A data.frame: 6 × 14
ageattritioneduedufielddailyhourlysatisfactionwrkyrsmonthlygenderovertimeperratingyratcomyrsincepromotion
<int><fct><fct><fct><int><int><fct><int><int><fct><fct><fct><int><int>
141Yes2Life Sciences1102944 819479FemaleYes3 60
249No 1Life Sciences 2796121024907Male No 4101
337Yes2Other 1373923 7 2396Male Yes3 00
433No 4Life Sciences1392563 823159FemaleYes3 83
527No 1Medical 591402 616632Male No 3 22
632No 2Life Sciences1005794 811864Male No 3 73

Visualization¶

Propose a visualization that I consider relevant to address my question or to explore the data.

After we have a tidy data, we can visualize the dataset between variables.

We need to see that which variables are relevant and detect the potential problems in the model.

First, we see the variables that are potentially associated with the response variable (attrition).

To see if there is a difference or any correlation between age and attrition status, we make a boxplot to view.

In [8]:
#Age vs Attrition
age_attrition_boxplots <- hrdata %>%
   ggplot() +
   geom_boxplot(aes(x = attrition, y = age)) +
   theme(
     text = element_text(size = 22),
     plot.title = element_text(face = "bold"),
     axis.title = element_text(face = "bold")
   ) +
   ggtitle("Boxplot of Age vs Attrition") +
   xlab("Attrtion") +
   ylab("Age of employees")
age_attrition_boxplots
No description has been provided for this image

The median of age of people with attrition is smaller than the people without attrition which may indicate that the company intends to keep their employees. And these two attrition status have similar range.

To view the count of each gender in people with attrition and without attrition

In [9]:
age_attrition_plot <- ggplot(hrdata, aes(x = attrition, fill = gender)) +
  geom_bar(position = 'dodge') +
  labs(title = 'Plot of Attrition and Gender', x = 'Attrition', y = 'Count') +
  theme_minimal()

age_attrition_plot
No description has been provided for this image

In both attrition status, male employees are more than female employees.

To view what the proportion of education level is in each attrition status

In [10]:
edu_attrition_plot <- ggplot(hrdata, aes(x = attrition, fill = edu)) +
  geom_bar(position = 'dodge') +
  labs(title = 'Plot of Attrition and Education level', x = 'Attrition', y = 'Count') +
  theme_minimal()

edu_attrition_plot
No description has been provided for this image

It can be told from the plot that the company perfers employee with education level 3. And both graphs have similar distribution.

To view what the proportion of education field is in each attrition status

In [11]:
edufield_attrition_plot <- ggplot(hrdata, aes(x = attrition, fill = edufield)) +
  geom_bar(position = 'dodge') +
  labs(title = 'Plot of Attrition and Education Field', x = 'Attrition', y = 'Count') +
  theme_minimal()

edufield_attrition_plot
No description has been provided for this image

The company perfers to keep employees that are in life science field and marketing area.

To visualize the overTime variable and attrition status.

In [12]:
overTime_attrition_plot <- ggplot(hrdata, aes(x = attrition, fill = overtime)) +
  geom_bar(position = 'dodge') +
  labs(title = 'Plot of Attrition and Overtime', x = 'Attrition', y = 'Count') +
  theme_minimal()

overTime_attrition_plot
No description has been provided for this image

It can be seen that employees with attrition have more overtime than the people with no attrition, which maybe is a reason that they leave the company.

To view the daily rate and attrition status, we make a boxplot.

In [13]:
daily_attrition_boxplots <- hrdata %>%
   ggplot() +
   geom_boxplot(aes(x = attrition, y = daily)) +
   theme(
     text = element_text(size = 22),
     plot.title = element_text(face = "bold"),
     axis.title = element_text(face = "bold")
   ) +
   ggtitle("Boxplot of Daily rate vs Attrition") +
   xlab("Attrtion") +
   ylab("Daily rate of employees")
daily_attrition_boxplots
No description has been provided for this image

The two groups of attrition have similar daily rate.

To view the work years in company variable and attrition status, we make a boxplot to see if there is a difference between the people with attrition and without it.

In [14]:
yratcom_attrition_boxplots <- hrdata %>%
   ggplot() +
   geom_boxplot(aes(x = attrition, y = yratcom)) +
   theme(
     text = element_text(size = 22),
     plot.title = element_text(face = "bold"),
     axis.title = element_text(face = "bold")
   ) +
   ggtitle("Boxplot of years at company vs Attrition") +
   xlab("Attrtion") +
   ylab("years of employees at company")
yratcom_attrition_boxplots
No description has been provided for this image

From the boxplot, it can be seen that the company would like to keep the employee in a long term and whenever they realise this employee is not suitable for the company, the employee has to leave the company, which I think is a reason of low median of yes attrition.

Visualize the performance rating variable versus attrition

In [15]:
perrating_attrition_plot <- ggplot(hrdata, aes(x = attrition, fill = perrating)) +
  geom_bar(position = 'dodge') +
  labs(title = 'Plot of Attrition and performance ', x = 'Attrition', y = 'Count') +
  theme_minimal()

perrating_attrition_plot
No description has been provided for this image

Visualize the year since last promotion variable versus attrition

In [16]:
yrsincepromotion_attrition_boxplots <- hrdata %>%
   ggplot() +
   geom_boxplot(aes(x = attrition, y = yrsincepromotion)) +
   theme(
     text = element_text(size = 22),
     plot.title = element_text(face = "bold"),
     axis.title = element_text(face = "bold")
   ) +
   ggtitle("Boxplot of year since promotion vs Attrition") +
   xlab("Attrtion") +
   ylab("years since last promotion")
yrsincepromotion_attrition_boxplots
No description has been provided for this image

Two groups have similar numeber of years since last promotion.

To see is there any correlation between work years variable and years in current role variable, we make a scatter plot filled by attrition. Based on this, we can know that how the work years in the company and years in current role affect the attrition status, such as an employee leaves the company because they have worked in the company for a long time but still in a same position for a long time.

In [17]:
ggplot(hrdata, aes(x = yratcom, y = yrsincepromotion, fill = attrition)) +
  geom_point(shape = 21, size = 3) +  # Use shape 21 for filled circles
  labs(title = "years at company vs years since last promotion", x = "years at company", y = "years since last promotion")
No description has been provided for this image

Since we need to consider about multicollinearity, I make a correlation ggpair graph and a correlation matrix to visualize and analyse it.

Explore the correlation between numerical variables.

In [18]:
options(repr.plot.width = 15, repr.plot.height = 12)
num_hrdata <- hrdata |> select(age,attrition,daily,yratcom,yrsincepromotion, hourly, wrkyrs,monthly)

hrdata_pair_plots <- num_hrdata %>%
   select(- attrition) %>% 
   ggpairs(progress = FALSE) +
   theme(
     text = element_text(size = 15),
     plot.title = element_text(face = "bold"),
     axis.title = element_text(face = "bold")
   )
 hrdata_pair_plots

corr_matrix_HRdata <- num_hrdata %>%
   select(- attrition) %>% 
   cor() %>%
   as.data.frame() %>%
   rownames_to_column("var1") %>%
   pivot_longer(-var1, names_to = "var2", values_to = "corr")
 head(corr_matrix_HRdata)
A tibble: 6 × 3
var1var2corr
<chr><chr><dbl>
ageage 1.00000000
agedaily 0.01066094
ageyratcom 0.31130877
ageyrsincepromotion0.21651337
agehourly 0.02428654
agewrkyrs 0.68038054
No description has been provided for this image
In [19]:
# Based on the given table, we conduct a correlation matrix.
options(repr.plot.width = 15, repr.plot.height = 10)

 plot_corr_matrix_HRdata<- corr_matrix_HRdata %>%
   ggplot(aes(x = var1, y = var2)) +
   geom_tile(aes(fill = corr), color = "white") +
   scale_fill_distiller("Correlation Coefficient \n",
     palette =  "YlOrRd",
     direction = 1, limits = c(-1,1)
   ) +
   labs(x = "Variable 1", y = "Variable 2") +
   theme_minimal() +
   theme(
     axis.text.x = element_text(
       angle = 45, vjust = 1,
       size = 18, hjust = 1
     ),
     axis.text.y = element_text(
       vjust = 1,
       size = 18, hjust = 1
     ),
     legend.title = element_text(size = 18, face = "bold"),
     legend.text = element_text(size = 18),
     legend.key.size = unit(2, "cm")
   ) +
   coord_fixed() +
   geom_text(aes(x = var1, y = var2, label = round(corr, 2)), color = "black", size = 6)
 plot_corr_matrix_HRdata
No description has been provided for this image

As you can see, there are several variables that have high correlation or multicollinearity which will be considered in the next step of this project.

Assignment 3: Method and Plan¶

After valid exploratory data analysis and visualization of variables (currently have done). Before building a predictive model, it's essential to split the dataset into training and test sets. This step ensures that the model can be trained on one set of data and evaluated on another, independent set to assess its performance. I will prepare the data for splitting by encoding categorical variables, handling missing values, and selecting relevant features that contribute to predicting attrition. Selection of variables is important and we can use VIF to deal with the problem of multicollinearity: smaller vif is the sign of less multicollinearity.

Since the attrition is categorical variable with two categories (yes and no), to address the question of interest, which is predicting employee attrition, we can use a logistic regression model. Logistic regression is a commonly used statistical method for binary classification problems, such as predicting the probability of an employee leaving (attrition = Yes) or staying (attrition = No). Since the function in R to fit a logistic regression requires a numerical response, We need to construct a response $Y_i$ such that $Y_i = 1$ if the $i$th employee is with attrition status "Yes", and $0$ otherwise. It is usually hard to predict the probability using a "linear model". We will be still using a linear component but not to directly model the conditional expectation. With some algebra, we can show that:

$$ \mbox{logit}(p_i) = \log \bigg( \frac{p_i}{1 - p_i}\bigg) = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_{q} X_{iq}, $$

From the odds, we can estimate the number of sucesses to number of failures.

And we will build the model and interpret the coefficients.

Why is this method appropriate?¶

Logistic regression is appropriate for this problem because:

Binary Outcome: The target variable, attrition, is binary, which fits the logistic regression model's design for predicting binary outcomes.

Interpretability: Logistic regression coefficients can be interpreted to understand the impact of each feature on the likelihood of attrition. This is useful for identifying factors that contribute to employee attrition.

Efficiency: Logistic regression is computationally efficient and can handle a large number of features, making it suitable for this dataset with multiple explanatory variables.

Assumptions:¶

The main assumptions required for logistic regression include:

Linearity: The relationship between the logit of the outcome and the independent variables is linear.

Independence: Observations are independent of each other.

No Multicollinearity: Independent variables should not be too highly correlated with each other.

Large Sample Size: Logistic regression requires a sufficient number of observations for reliable results.

Potential Limitations or Weaknesses:¶

Linearity Assumption: If the relationship between the predictors and the log odds of the outcome is not linear, the model's performance might be compromised. Outliers and Influential Points: Logistic regression can be sensitive to outliers and influential points, which can affect the model's accuracy.

Overfitting: With a large number of features, there is a risk of overfitting the model to the training data, which can reduce its generalizability to new data. In summary, logistic regression is a suitable method for predicting employee attrition due to its appropriateness for binary outcomes, interpretability, and efficiency. However, it is important to consider its assumptions and potential limitations when applying it to the dataset.

In [ ]: